WEEK 5: SPECIAL DATA TYPES

Monday, February 6th

Today we will…

  • Final Project Group Formation
  • Review Lab 4
  • Mini lecture on text material
    • Reordering Factor Variables
    • Working with Date & Time Variables
  • PA 5.1: Zodiac Killer

Final Project Group Formation

You will be completing a final project in Stat 331/531 in teams of four. More details to come soon!

In general, coming up submissions to be aware of:

  • Group Formation Survey due Friday, 2/10 at 11:59pm
    • Meant to help me gather information about your preferences and work style in order to facilitate the team formation process.
    • Your groupmates do not need to be in the same section as you, but you might find it useful for worktime during class.
  • Group Contracts
    • Let’s have an open conversation with our team to make things go smooth
  • Project Proposal – I will provide details on this!
  • Final Project Deliverable
    • Might be broken up into two steps since I know we like to procrastinate!

Lab 4: Tips

Don’t include pages of output

Use the code chunk option

#| output: false

#| results: false


Data Description Components

Make sure to include context when describing the data set as well as the data characteristics.

  • Where did the data come from? What years? Location? Source?
  • What is the data being used for?
  • What are the variables (in context) and observations (in context)?

mutate() vs summarise()

Lab 4: Game Plans!

Read

Average

Total

Which or For Each

Minimum

Maximum

Minimum and Maximum

Think

summarize(avg_var = mean())

summarize(total = sum())

group_by()

slice_min()

slice_max()

arrange() |> slice(1,n())

Lab 4: Better alternatives to bar plots

Bar plots are typically reserved for displaying frequencies

# A tibble: 4 × 3
  geography     mean_price_diff sd_price_diff
  <fct>                   <dbl>         <dbl>
1 San Francisco           0.719         0.334
2 San Diego               0.685         0.211
3 Sacramento              0.578         0.270
4 Los Angeles             0.528         0.188
Code
diff_summary  |> 
  ggplot(aes(x = mean_price_diff, 
             y = geography,
             fill = geography)
         ) +
  geom_bar(stat = "identity") +
  labs(subtitle = "Geography",
       x = "Difference in Price ($)\nOrganic - Conventional",
       y = "") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Dark2")

Read more about Cleveland Dot Plots

Code
diff_summary |> 
  arrange(desc(mean_price_diff)) |> 
  ggplot(aes(x = mean_price_diff, 
             y = geography,
             fill = geography)
         ) +
  geom_segment(aes(xend = 0,
                   yend = geography)
  ) +
  geom_point() +
  labs(subtitle = "Geography",
       x = "Difference in Price ($)\nOrganic - Conventional",
       y = "") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Dark2")

Factor Variables

library(forcats) cheatsheet

Common tasks

  • Turn a character or numeric variable into a factor

  • Make a factor by discritizing / “binning” a numeric variable

  • Rename or reorder the levels of an existing factor

Note

The packages forcats (“for categoricals”) gives nice shortcuts for wrangling categorical variables.

  • forcats loads with the tidyverse!

Create a factor

x <- c("apple", "dog", "banana", "cat", "banana", "Queen Elizabeth", "dog")
x
[1] "apple"           "dog"             "banana"          "cat"            
[5] "banana"          "Queen Elizabeth" "dog"            


x <- factor(x)
x
[1] apple           dog             banana          cat            
[5] banana          Queen Elizabeth dog            
Levels: apple banana cat dog Queen Elizabeth

What happened?

fct_recode()

new level = old level

x <- fct_recode(x,
                "fruit" = "apple",
                "fruit" = "banana",
                "pet"   = "cat",
                "pet"   = "dog"
                )
x
[1] fruit           pet             fruit           pet            
[5] fruit           Queen Elizabeth pet            
Levels: fruit pet Queen Elizabeth

Note

Notice Queen Elizabeth is a “remaining” level that was never recoded.

fct_relevel()

x <- fct_relevel(x, 
                 levels = c("Queen Elizabeth", 
                            "pet", 
                            "fruit")
)
levels(x)
[1] "Queen Elizabeth" "pet"             "fruit"          

Factors in the tidyverse

library(liver)
data(cereal)
str(cereal)
'data.frame':   77 obs. of  16 variables:
 $ name    : Factor w/ 77 levels "100% Bran","100% Natural Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ manuf   : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
 $ type    : Factor w/ 2 levels "cold","hot": 1 1 1 1 1 1 1 1 1 1 ...
 $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
 $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
 $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
 $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
 $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
 $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
 $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
 $ potass  : int  280 135 320 330 -1 70 30 100 125 190 ...
 $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
 $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
 $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
 $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
 $ rating  : num  68.4 34 59.4 93.7 34.4 ...
cereal_casewhen <- cereal |> 
  mutate(manuf = case_when(manuf == "A" ~ "American Home Food Products", 
                           manuf == "G" ~ "General Mills", 
                           manuf == "K" ~ "Kelloggs", 
                           manuf == "N" ~ "Nabisco", 
                           manuf == "P" ~ "Post", 
                           manuf == "Q" ~ "Quaker Oats", 
                           manuf == "R" ~ "Ralston Purina"
                           ),
         manuf = as.factor(manuf)
  )
summary(cereal_casewhen$manuf)
American Home Food Products               General Mills 
                          1                          22 
                   Kelloggs                     Nabisco 
                         23                           6 
                       Post                 Quaker Oats 
                          9                           8 
             Ralston Purina 
                          8 
cereal_recode <- cereal |> 
  mutate(manuf = fct_recode(manuf, 
                             "American Home Food Products" = "A", 
                             "General Mills" = "G", 
                             "Kelloggs" = "K", 
                             "Nabisco" = "N", 
                             "Post" = "P", 
                             "Quaker Oats" = "Q", 
                             "Ralston Purina" = "R"
                           )
  )

summary(cereal_recode$manuf)
American Home Food Products               General Mills 
                          1                          22 
                   Kelloggs                     Nabisco 
                         23                           6 
                       Post                 Quaker Oats 
                          9                           8 
             Ralston Purina 
                          8 

Factors in ggplot2

Disclaimer: fix your axes and legend labels!

Code
library(ggridges)
cereal_recode |> 
  ggplot(aes(x = sugars, 
             y = manuf, 
             fill = manuf)) +
  geom_density_ridges() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs()

Default reorders by the median value

Code
cereal_recode |> 
  ggplot(aes(x = sugars, 
             y = fct_reorder(.f = manuf, 
                             .x = sugars,
                             .fun = mean), 
             fill = manuf)
         ) +
  geom_density_ridges() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs()

Factors in ggplot2

Disclaimer: fix your axes and legend labels!

Code
cereal_recode |> 
  group_by(manuf, shelf) |> 
  summarise(avg_sugar = mean(sugars, na.rm = TRUE)) |> 
  ggplot(aes(x = shelf, 
             y = avg_sugar, 
             color = manuf)
         ) +
  geom_line() +
  theme_minimal() +
  labs()

Code
cereal_recode |> 
  group_by(manuf, shelf) |> 
  summarise(avg_sugar = mean(sugars, na.rm = TRUE)) |> 
  ggplot(aes(x = shelf, 
             y = avg_sugar, 
             color = fct_reorder2(manuf, .x = shelf, .y = avg_sugar))
         ) +
  geom_line() +
  theme_minimal() +
  labs()

Lab 5: Factors in Data Visualizations

Danger

You will be required to use functions from the {forcats} package! e.g. reorder() is a no go, use fct_reorder instead!

Date + Time Variables

library(lubridate)

Common Tasks

  • Convert a date-like variable (“May 8, 1995”) to a special DateTime Object.

  • Find the weekday, month, year, etc from a DateTime object

  • Convert between timezones

Note

The package lubridate is AMAZING for this.

  • lubridate does not load with the tidyverse but it does install with it.
library(lubridate)

datetime Objects

There are actually three data types (classes) in R for dates and datetimes.

  • Date (duh)

  • POSIXlt (???)

  • and POSIXct (???)

History of POSIXlt and POSIXct

  • POSIXct – stores date/time values as the number of seconds since January 1, 1970 (“Unix Epoch”)

  • POSIXlt – stores date/time values as a list with elements for second, minute, hour, day, month, and year, among others.

In the “old days”, to make a Date or Datetime object, you’d have to get the format just right.

as.Date("1995-05-08") |>  str()
 Date[1:1], format: "1995-05-08"
as_datetime("1995-05-08") |>  str()
 POSIXct[1:1], format: "1995-05-08"

Enter lubridate!

make_date(year = 1995, month = 05, day = 08)
[1] "1995-05-08"
mdy("May 8, 1995")
[1] "1995-05-08"
dmy("8-May-1995", tz = "America/Chicago")
[1] "1995-05-08 CDT"
parse_datetime("05/8/1995", format = "mdy")
[1] NA
parse_datetime("5/8/1995", format = "%m/%d/%Y")
[1] "1995-05-08 UTC"

Common mistake with dates

What is wrong with these two code chunks?

as_datetime(2023-02-6)
[1] "1970-01-01 00:33:35 UTC"


my_date <- 2023-02-6
my_date
[1] 2015

Components of dates

bday <- ymd_hms("1995-05-8 6:32:12", tz = "America/Chicago")
bday
[1] "1995-05-08 06:32:12 CDT"


year(bday)
[1] 1995
month(bday)
[1] 5
day(bday)
[1] 8
wday(bday)
[1] 2
wday(bday, label = TRUE)
[1] Mon
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
wday(bday, label = TRUE)
[1] Mon
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

When is my ______ birthday?


next birthday…

(bday + years(28)) |>  
  wday(label = TRUE, abbr = FALSE)
[1] Monday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday


hundredth…

bday + years(100)
[1] "2095-05-08 06:32:12 CDT"

PA 5.1: Zodiac Killer

One of the most famous mysteries in California history is the identity of the so-called “Zodiac Killer”, who murdered 7 people in Northern California between 1968 and 1969. A new murder was committed last year in California, suspected to be the work of a new Zodiac Killer on the loose.

Unfortunately, the date and time of the murder is not known. You have been hired to crack the case. Use the clues below to discover the murderer’s identity.

Submit the name of the killer to the Canvas Quiz.

To do…

  • PA 5.1: Zodiac Killer
    • Due Wednesday, 2/8 at 8:00am
  • Lab 5: Factors in Data Visualization
    • Due Friday, 2/10 at 11:59pm
  • Final Project Group Formation Survey
    • Due Friday, 2/10 at 11:59pm

Wednesday, February 8th

Today we will…

  • Review PA 5.1: Zodiac Killer
  • Midterm Exam 2/15: What to Expect
    • Example Game Plans
    • Example Open-ended Analysis
  • Mini lecture on text material
    • Strings
    • Regular Expressions
  • Example: “Messy” Covid Variants
  • PA 5.2: Scrambled Message

Bonus Challenge: Save the Date

What Data Mishaps Night

When Thursday, February 23rd at 5pm PST

Where Zoom!

Danger

You must register for free here to receive the link!

  • Submit to canvas a “hardy” paragraph reflection for each of the keynote and/or theme sessions you attend.
  • Each reflection gets you +2 challenge points (that is up to +8 for the whole night!)

Tip

Make sure to indicate with a header, which theme your reflection is for. You can write this in any document editor of your choosing (Word, Quarto, Google Docs, etc.)

Midterm Exam – In-class Wednesday 2/15

Lab 4: Game Plans!

stringr

strings

A string is a bunch of characters.

Don’t confuse a string (many characters, one object) with a character vector (vector of strings).


my_string <- "Hi, my name is Bond!"
my_vector <- c("Hi", "my", "name", "is", "Bond")


my_string
[1] "Hi, my name is Bond!"


my_vector
[1] "Hi"   "my"   "name" "is"   "Bond"

stringr

Common tasks

  • Find which strings contain a particular pattern

  • Remove or replace a pattern

  • Edit a string (for example, make it lowercase)

Note

The package stringr is very useful for strings!

  • stringr loads with the tidyverse.

  • all the functions are str_xxx().

pattern =

The pattern argument in all of the stringr functions …

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")

str_detect(my_vector, pattern = "Bond")
str_locate(my_vector, pattern = "Bond")
str_match(my_vector, pattern = "Bond")
str_extract(my_vector, pattern = "Bond")
str_subset(my_vector, "pattern = Bond")

Note

Discuss with a neighbor. For each of these functions, give:

  • The object structure of the output.
  • The data type of the output.
  • A brief explanation of what they do.

str_detect()

Returns logical vector TRUE/FALSE indicating if the pattern was found in that element of the original vector

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")
[1] FALSE FALSE  TRUE  TRUE
  • Pairs well with filter()
  • Could be used with summarise() and sum or mean

Related functions

str_subset() returns just the strings that contain the match

str_which() returns the indexes of strings that have a match

str_match()

Returns character matrix with either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")
[1] FALSE FALSE  TRUE  TRUE

str_extract()

Returns character vector with either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_extract(my_vector, pattern = "Bond")
[1] NA     NA     "Bond" "Bond"

Warning

str_extract() only returns the first pattern match; use str_extract_all() to return every pattern match.

str_locate()

Returns a date frame with two numeric variables for the starting and ending location, giving either NA or the start and end position of the pattern.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_locate(my_vector, pattern = "Bond")
     start end
[1,]    NA  NA
[2,]    NA  NA
[3,]     1   4
[4,]     7  10

str_subset()

Returns a character vector with a subset of the original character vector with elements where the pattern occurs.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_subset(my_vector, pattern = "Bond")
[1] "Bond"       "James Bond"

Related Functions

str_sub() extracts values based on location.

Replace / Remove patterns

str_replace(x, pattern = "", replace = "")

replaces the first matched pattern

  • Pairs well with mutate()
str_replace(my_vector, pattern = "Bond", replace = "Franco")
[1] "Hello,"       "my name is"   "Franco"       "James Franco"

Removes the first matched pattern

Special case – str_replace(x, pattern, replace = "")

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_remove(my_vector, pattern = "Bond")
[1] "Hello,"     "my name is" ""           "James "    

Related functions

str_replace_all() replaces all matched patterns

str_remove_all() removes all matched patterns

Make edits

Convert letters in the string to a specific capitalization format.

converts all letters in the strings to lowercase


my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_to_lower(my_vector)
[1] "hello,"     "my name is" "bond"       "james bond"

converts all letters in the strings to uppercase


str_to_upper(my_vector)
[1] "HELLO,"     "MY NAME IS" "BOND"       "JAMES BOND"

converts the first letter of the strings to uppercase


str_to_title(my_vector)
[1] "Hello,"     "My Name Is" "Bond"       "James Bond"

Combine Strings

Joins multiple strings into a single string.

prompt <- "Hello, my name is"
first  <- "James"
last   <- "Bond"
str_c(prompt, first, last, sep = " ")
[1] "Hello, my name is James Bond"

Combines into a single string.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_flatten(my_vector, collapse = " ")
[1] "Hello, my name is Bond James Bond"

Note

str_c() will do the same thing, but it it is encouraged to use str_flatten() instead.

Uses environment to create a string and evaluates {expressions}.

first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")
My name is Bond, James Bond

Tip

See the R package glue!

Hints and Tips for Success

  • Refer to the stringr cheatsheet

  • Remember that str_xxx functions need the first argument to be a vector of strings, not a data set.

    • You might want to use them inside functions like filter() or mutate().
cereal |> 
  mutate(
    is_bran = str_detect(name, "Bran"), 
    .after = name
  )
                                     name is_bran manuf type calories protein
1                               100% Bran    TRUE     N cold       70       4
2                       100% Natural Bran    TRUE     Q cold      120       3
3                                All-Bran    TRUE     K cold       70       4
4               All-Bran with Extra Fiber    TRUE     K cold       50       4
5                          Almond Delight   FALSE     R cold      110       2
6                 Apple Cinnamon Cheerios   FALSE     G cold      110       2
7                             Apple Jacks   FALSE     K cold      110       2
8                                 Basic 4   FALSE     G cold      130       3
9                               Bran Chex    TRUE     R cold       90       2
10                            Bran Flakes    TRUE     P cold       90       3
11                           Cap'n'Crunch   FALSE     Q cold      120       1
12                               Cheerios   FALSE     G cold      110       6
13                  Cinnamon Toast Crunch   FALSE     G cold      120       1
14                               Clusters   FALSE     G cold      110       3
15                            Cocoa Puffs   FALSE     G cold      110       1
16                              Corn Chex   FALSE     R cold      110       2
17                            Corn Flakes   FALSE     K cold      100       2
18                              Corn Pops   FALSE     K cold      110       1
19                          Count Chocula   FALSE     G cold      110       1
20                     Cracklin' Oat Bran    TRUE     K cold      110       3
21                 Cream of Wheat (Quick)   FALSE     N  hot      100       3
22                                Crispix   FALSE     K cold      110       2
23                 Crispy Wheat & Raisins   FALSE     G cold      100       2
24                            Double Chex   FALSE     R cold      100       2
25                            Froot Loops   FALSE     K cold      110       2
26                         Frosted Flakes   FALSE     K cold      110       1
27                    Frosted Mini-Wheats   FALSE     K cold      100       3
28 Fruit & Fibre Dates; Walnuts; and Oats   FALSE     P cold      120       3
29                          Fruitful Bran    TRUE     K cold      120       3
30                         Fruity Pebbles   FALSE     P cold      110       1
31                           Golden Crisp   FALSE     P cold      100       2
32                         Golden Grahams   FALSE     G cold      110       1
33                      Grape Nuts Flakes   FALSE     P cold      100       3
34                             Grape-Nuts   FALSE     P cold      110       3
35                     Great Grains Pecan   FALSE     P cold      120       3
36                       Honey Graham Ohs   FALSE     Q cold      120       1
37                     Honey Nut Cheerios   FALSE     G cold      110       3
38                             Honey-comb   FALSE     P cold      110       1
39            Just Right Crunchy  Nuggets   FALSE     K cold      110       2
40                 Just Right Fruit & Nut   FALSE     K cold      140       3
41                                    Kix   FALSE     G cold      110       2
42                                   Life   FALSE     Q cold      100       4
43                           Lucky Charms   FALSE     G cold      110       2
44                                  Maypo   FALSE     A  hot      100       4
45       Muesli Raisins; Dates; & Almonds   FALSE     R cold      150       4
46      Muesli Raisins; Peaches; & Pecans   FALSE     R cold      150       4
47                   Mueslix Crispy Blend   FALSE     K cold      160       3
48                   Multi-Grain Cheerios   FALSE     G cold      100       2
49                       Nut&Honey Crunch   FALSE     K cold      120       2
50              Nutri-Grain Almond-Raisin   FALSE     K cold      140       3
51                      Nutri-grain Wheat   FALSE     K cold       90       3
52                   Oatmeal Raisin Crisp   FALSE     G cold      130       3
53                  Post Nat. Raisin Bran    TRUE     P cold      120       3
54                             Product 19   FALSE     K cold      100       3
55                            Puffed Rice   FALSE     Q cold       50       1
56                           Puffed Wheat   FALSE     Q cold       50       2
57                     Quaker Oat Squares   FALSE     Q cold      100       4
58                         Quaker Oatmeal   FALSE     Q  hot      100       5
59                            Raisin Bran    TRUE     K cold      120       3
60                        Raisin Nut Bran    TRUE     G cold      100       3
61                         Raisin Squares   FALSE     K cold       90       2
62                              Rice Chex   FALSE     R cold      110       1
63                          Rice Krispies   FALSE     K cold      110       2
64                         Shredded Wheat   FALSE     N cold       80       2
65                 Shredded Wheat 'n'Bran    TRUE     N cold       90       3
66              Shredded Wheat spoon size   FALSE     N cold       90       3
67                                 Smacks   FALSE     K cold      110       2
68                              Special K   FALSE     K cold      110       6
69                Strawberry Fruit Wheats   FALSE     N cold       90       2
70                      Total Corn Flakes   FALSE     G cold      110       2
71                      Total Raisin Bran    TRUE     G cold      140       3
72                      Total Whole Grain   FALSE     G cold      100       3
73                                Triples   FALSE     G cold      110       2
74                                   Trix   FALSE     G cold      110       1
75                             Wheat Chex   FALSE     R cold      100       3
76                               Wheaties   FALSE     G cold      100       3
77                    Wheaties Honey Gold   FALSE     G cold      110       2
   fat sodium fiber carbo sugars potass vitamins shelf weight cups   rating
1    1    130  10.0   5.0      6    280       25     3   1.00 0.33 68.40297
2    5     15   2.0   8.0      8    135        0     3   1.00 1.00 33.98368
3    1    260   9.0   7.0      5    320       25     3   1.00 0.33 59.42551
4    0    140  14.0   8.0      0    330       25     3   1.00 0.50 93.70491
5    2    200   1.0  14.0      8     -1       25     3   1.00 0.75 34.38484
6    2    180   1.5  10.5     10     70       25     1   1.00 0.75 29.50954
7    0    125   1.0  11.0     14     30       25     2   1.00 1.00 33.17409
8    2    210   2.0  18.0      8    100       25     3   1.33 0.75 37.03856
9    1    200   4.0  15.0      6    125       25     1   1.00 0.67 49.12025
10   0    210   5.0  13.0      5    190       25     3   1.00 0.67 53.31381
11   2    220   0.0  12.0     12     35       25     2   1.00 0.75 18.04285
12   2    290   2.0  17.0      1    105       25     1   1.00 1.25 50.76500
13   3    210   0.0  13.0      9     45       25     2   1.00 0.75 19.82357
14   2    140   2.0  13.0      7    105       25     3   1.00 0.50 40.40021
15   1    180   0.0  12.0     13     55       25     2   1.00 1.00 22.73645
16   0    280   0.0  22.0      3     25       25     1   1.00 1.00 41.44502
17   0    290   1.0  21.0      2     35       25     1   1.00 1.00 45.86332
18   0     90   1.0  13.0     12     20       25     2   1.00 1.00 35.78279
19   1    180   0.0  12.0     13     65       25     2   1.00 1.00 22.39651
20   3    140   4.0  10.0      7    160       25     3   1.00 0.50 40.44877
21   0     80   1.0  21.0      0     -1        0     2   1.00 1.00 64.53382
22   0    220   1.0  21.0      3     30       25     3   1.00 1.00 46.89564
23   1    140   2.0  11.0     10    120       25     3   1.00 0.75 36.17620
24   0    190   1.0  18.0      5     80       25     3   1.00 0.75 44.33086
25   1    125   1.0  11.0     13     30       25     2   1.00 1.00 32.20758
26   0    200   1.0  14.0     11     25       25     1   1.00 0.75 31.43597
27   0      0   3.0  14.0      7    100       25     2   1.00 0.80 58.34514
28   2    160   5.0  12.0     10    200       25     3   1.25 0.67 40.91705
29   0    240   5.0  14.0     12    190       25     3   1.33 0.67 41.01549
30   1    135   0.0  13.0     12     25       25     2   1.00 0.75 28.02576
31   0     45   0.0  11.0     15     40       25     1   1.00 0.88 35.25244
32   1    280   0.0  15.0      9     45       25     2   1.00 0.75 23.80404
33   1    140   3.0  15.0      5     85       25     3   1.00 0.88 52.07690
34   0    170   3.0  17.0      3     90       25     3   1.00 0.25 53.37101
35   3     75   3.0  13.0      4    100       25     3   1.00 0.33 45.81172
36   2    220   1.0  12.0     11     45       25     2   1.00 1.00 21.87129
37   1    250   1.5  11.5     10     90       25     1   1.00 0.75 31.07222
38   0    180   0.0  14.0     11     35       25     1   1.00 1.33 28.74241
39   1    170   1.0  17.0      6     60      100     3   1.00 1.00 36.52368
40   1    170   2.0  20.0      9     95      100     3   1.30 0.75 36.47151
41   1    260   0.0  21.0      3     40       25     2   1.00 1.50 39.24111
42   2    150   2.0  12.0      6     95       25     2   1.00 0.67 45.32807
43   1    180   0.0  12.0     12     55       25     2   1.00 1.00 26.73451
44   1      0   0.0  16.0      3     95       25     2   1.00 1.00 54.85092
45   3     95   3.0  16.0     11    170       25     3   1.00 1.00 37.13686
46   3    150   3.0  16.0     11    170       25     3   1.00 1.00 34.13976
47   2    150   3.0  17.0     13    160       25     3   1.50 0.67 30.31335
48   1    220   2.0  15.0      6     90       25     1   1.00 1.00 40.10596
49   1    190   0.0  15.0      9     40       25     2   1.00 0.67 29.92429
50   2    220   3.0  21.0      7    130       25     3   1.33 0.67 40.69232
51   0    170   3.0  18.0      2     90       25     3   1.00 1.00 59.64284
52   2    170   1.5  13.5     10    120       25     3   1.25 0.50 30.45084
53   1    200   6.0  11.0     14    260       25     3   1.33 0.67 37.84059
54   0    320   1.0  20.0      3     45      100     3   1.00 1.00 41.50354
55   0      0   0.0  13.0      0     15        0     3   0.50 1.00 60.75611
56   0      0   1.0  10.0      0     50        0     3   0.50 1.00 63.00565
57   1    135   2.0  14.0      6    110       25     3   1.00 0.50 49.51187
58   2      0   2.7  -1.0     -1    110        0     1   1.00 0.67 50.82839
59   1    210   5.0  14.0     12    240       25     2   1.33 0.75 39.25920
60   2    140   2.5  10.5      8    140       25     3   1.00 0.50 39.70340
61   0      0   2.0  15.0      6    110       25     3   1.00 0.50 55.33314
62   0    240   0.0  23.0      2     30       25     1   1.00 1.13 41.99893
63   0    290   0.0  22.0      3     35       25     1   1.00 1.00 40.56016
64   0      0   3.0  16.0      0     95        0     1   0.83 1.00 68.23588
65   0      0   4.0  19.0      0    140        0     1   1.00 0.67 74.47295
66   0      0   3.0  20.0      0    120        0     1   1.00 0.67 72.80179
67   1     70   1.0   9.0     15     40       25     2   1.00 0.75 31.23005
68   0    230   1.0  16.0      3     55       25     1   1.00 1.00 53.13132
69   0     15   3.0  15.0      5     90       25     2   1.00 1.00 59.36399
70   1    200   0.0  21.0      3     35      100     3   1.00 1.00 38.83975
71   1    190   4.0  15.0     14    230      100     3   1.50 1.00 28.59278
72   1    200   3.0  16.0      3    110      100     3   1.00 1.00 46.65884
73   1    250   0.0  21.0      3     60       25     3   1.00 0.75 39.10617
74   1    140   0.0  13.0     12     25       25     2   1.00 1.00 27.75330
75   1    230   3.0  17.0      3    115       25     1   1.00 0.67 49.78744
76   1    200   3.0  17.0      3    110       25     1   1.00 1.00 51.59219
77   1    200   1.0  16.0      8     60       25     1   1.00 0.75 36.18756

regex

Regular Expressions

“Regexps are a very terse language that allow you to describe patterns in strings.”

R for Data Science

R uses “extended” regular expressions, which are common.

pattern = "REGEX GOES HERE"

Web app to test R regular expressions

Tip

Regular expressions are a reason to use stringr!

You might encounter gsub(), grep(), etc. from Base R.

Meta Characters . ^ $ \ | * + ? { } [ ] ( )

toung_twister <- c("She", "sells", "seashells", "by", "the", "seashore!")
toung_twister
[1] "She"       "sells"     "seashells" "by"        "the"       "seashore!"


. Represents any character

str_subset(toung_twister, pattern = ".ea")
[1] "seashells" "seashore!"
toung_twister <- c("She", "sells", "seashells", "by", "the", "seashore!")
toung_twister
[1] "She"       "sells"     "seashells" "by"        "the"       "seashore!"
toung_twister
[1] "She"       "sells"     "seashells" "by"        "the"       "seashore!"


^ Looks at the beginning

str_subset(toung_twister, pattern = "^s")
[1] "sells"     "seashells" "seashore!"

$ Looks at the end

str_subset(toung_twister, pattern = "!$")
[1] "seashore!"
shells_str <- c("shes", "shels", "shells", "shellls", "shelllls")
shells_str
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"


? Occurs 0 or 1 times

str_subset(shells_str, pattern = "shel?s")
[1] "shes"  "shels"

+ Occurs 1 or more times

str_subset(shells_str, pattern = "shel+s")
[1] "shels"    "shells"   "shellls"  "shelllls"

* Occurs 0 or more times

str_subset(shells_str, pattern = "shel*s")
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"
shells_str <- c("shes", "shels", "shells", "shellls", "shelllls")
shells_str
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"


{n} matches exactly n times.

str_subset(shells_str, pattern = "shel{2}s")
[1] "shells"

{n,} matches at least n times.

str_subset(shells_str, pattern = "shel{2}s")
[1] "shells"

{n,m} matches between n and m times.

str_subset(shells_str, pattern = "shel{1,3}s")
[1] "shels"   "shells"  "shellls"

Groups ()

Groups can be created with ( )

| – “either” / “or”


toung_twister2 <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
toung_twister2
[1] "Peter"    "Piper"    "picked"   "a"        "peck"     "of"       "pickled" 
[8] "peppers!"


str_subset(toung_twister2, pattern = "p(e|i)ck")
[1] "picked"  "peck"    "pickled"

Character Classes []

toung_twister2 <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
toung_twister2
[1] "Peter"    "Piper"    "picked"   "a"        "peck"     "of"       "pickled" 
[8] "peppers!"
str_subset(toung_twister2, pattern = "p[ei]ck")
[1] "picked"  "peck"    "pickled"

[^ ] except - think “not”

str_subset(toung_twister2, pattern = "p[^i]ck")
[1] "peck"

[ - ] range

str_subset(toung_twister2, pattern = "p[ei]ck[a-z]")
[1] "picked"  "pickled"

[Pp] Capitalization matters

str_subset(toung_twister2, pattern = "^p")
[1] "picked"   "peck"     "pickled"  "peppers!"
str_subset(toung_twister2, pattern = "^[Pp]")
[1] "Peter"    "Piper"    "picked"   "peck"     "pickled"  "peppers!"

[] Character Classes

  • [A-Z] matches any capital letter.
  • [a-z] matches any lowercase letter.
  • [A-z] or [:alpha:] matches any letter
  • [0-9] or [:digit:] matches any number
  • See the stringr cheatsheet for more shortcuts, like [:punct:]

\w Looks for any “word” (conversely “not” “word” \W)

\d Looks for any digit (conversely “not” digit \D)

\s Looks for any whitespace (conversely “not” whitespace \S)

Let’s try it out!

Write a regular expressions that search for words that do the following:

  • end with a vowel
  • start with x, y, or z
  • do not contain x, y, or z
  • contain British spelling (e.g. color vs colour)

Test your answers out on

test_vec <- c("zebra", "xray", "apple", "yellow", "grey", "gray")

Escape \

In order to match a special character you need to “escape” first

Warning

In general, look at punctuation characters with suspicion.

 [1] "How"       "much"      "wood"      "could"     "a"         "woodchuck"
 [7] "chuck"     "if"        "a"         "woodchuck" "could"     "chuck"    
[13] "wood?"    
str_subset(toung_twister3, pattern = "?")
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)
str_subset(toung_twister3, pattern = "\\?")
[1] "wood?"

Note

Could also use [] character class

str_subset(toung_twister3, pattern = "[?]")
[1] "wood?"

When in Doubt



Use the web app to test R regular expressions

Tips for working with regex

  • Read the regular expressions out loud like a “request”

str_view() and str_view_all()

shells_str
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"
str_view(shells_str, "l+")
[2] │ she<l>s
[3] │ she<ll>s
[4] │ she<lll>s
[5] │ she<llll>s
str_view_all(shells_str, "l+")
[1] │ shes
[2] │ she<l>s
[3] │ she<ll>s
[4] │ she<lll>s
[5] │ she<llll>s

Tips for working with regex

  • Everyone has a love-hate relationship with regular expressions. Be kind to yourself.

strings in the tidyverse

matches(pattern)

Selects all variables with a name that matches the supplied pattern

  • pairs well with select(), rename_with(), and across()
military_clean <- military |> 
  mutate(across(`1988`:`2019`, 
                ~ na_if(.x, y = ". .")
                ),
         across(`1988`:`2019`, 
                ~ na_if(.x, y = "xxx")
                )
         )
military_clean <- military |> 
  mutate(
         across(matches("[1-9]"), 
                ~ na_if(.x, y = ". .")
                ),
         across(matches("[1-9]"), 
                ~ na_if(.x, y = "xxx")
                )
         )

“Messy” Covid Variants

I received this data from a grad school colleague the other day who asked if I knew how to “clean” it.

What is that column?! 😮

[{'variant': 'Other', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 4.59}, {'variant': 'V-20DEC-01 (Alpha)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21APR-02 (Delta B.1.617.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21OCT-01 (Delta AY 4.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-22DEC-01 (Omicron CH.1.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 24.56}, {'variant': 'V-22JUL-01 (Omicron BA.2.75)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 8.93}, {'variant': 'V-22OCT-01 (Omicron BQ.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 49.57}, {'variant': 'VOC-21NOV-01 (Omicron BA.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.02}, {'variant': 'VOC-22APR-03 (Omicron BA.4)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.08}, {'variant': 'VOC-22APR-04 (Omicron BA.5)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.59}, {'variant': 'VOC-22JAN-01 (Omicron BA.2)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 1.41}, {'variant': 'unclassified_variant', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.26}]

Enter stringr! 🎉

Let’s see how this works.

PA 5.2: Scrambled Message

In this activity, you will be using regular expressions to decode a message.

toung_twister3
 [1] "How"       "much"      "wood"      "could"     "a"         "woodchuck"
 [7] "chuck"     "if"        "a"         "woodchuck" "could"     "chuck"    
[13] "wood?"    
  • You can grab the elements out of a vector with [] – read “where”
toung_twister3[c(3,6, 10,13)]
[1] "wood"      "woodchuck" "woodchuck" "wood?"    
  • If you want to replace those elements or change those elements use the assignment arrow!
toung_twister3[c(3,6, 10,13)] <- "WOOD!"

To do…

  • PA 5.2: Scrambled Message
    • Due Friday, 2/10 at 8:00am
  • Lab 5: Factors in Data Visualization
    • Due Friday, 2/10 at 11:59pm
  • Final Project Group Formation Survey
    • Due Friday, 2/10 at 11:59pm
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Sunday 2/12 at 11:59pm
  • Read Chapter 6: Version Control
    • Concept Check 6.1 + 6.2 due Monday (2/13) at 8:00am